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Abstract. We study the problem of Byzantine-robust topology discov- 
ery in an arbitrary asynchronous network. We formally state the weak 
and strong versions of the problem. The weak version requires that ei- 
ther each node discovers the topology of the network or at least one node 
detects the presence of a faulty node. The strong version requires that 
each node discovers the topology regardless of faults. 

We focus on non-cryptographic solutions to these problems. We explore 
their bounds. We prove that the weak topology discovery problem is 
solvable only if the connectivity of the network exceeds the number of 
faults in the system. Similarly, we show that the strong version of the 
problem is solvable only if the network connectivity is more than twice 
the number of faults. 

We present solutions to both versions of the problem. The presented 
algorithms match the established graph connectivity bounds. The algo- 
rithms do not require the individual nodes to know either the diameter 
or the size of the network. The message complexity of both programs is 
low polynomial with respect to the network size. We describe how our 
solutions can be extended to add the property of termination, handle 
topology changes and perform neighborhood discovery. 



1 Introduction 

In this paper, we investigate the problem of Byzantine-tolerant distributed topol- 
ogy discovery in an arbitrary network. Each node is only aware of its neighboring 
peers and it needs to learn the topology of the entire network. 
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Topology discovery is an essential problem in distributed computing {e.g. 
see [21]). It has direct applic;ability in practical systems. For example, link-state 
based routing protocols such as OSPF use topology discovery mechanisms to 
compute the routing tables. Recently, the problem has come to the fore with 
the introduction of ad hoc wireless sensor networks, such as Berkeley motes [8], 
where topology discovery is indispensable for routing decisions. 

As reliability demands on distributed systems increase, the interest in de- 
veloping robust topology discovery programs grows. One of the strongest fault 
models is Byzantine [11]: the faulty node behaves arbitrarily. This model en- 
compasses rich set of fault scenarios. Moreover, Byzantine fault tolerance has 
security implications, as the behavior of an intruder can be modeled as Byzan- 
tine. One approach to deal with Byzantine faults is by enabling the nodes to use 
cryptographic operations such as digital signatures or certificates. This limits the 
power of a Byzantine node as a non-faulty node can verify the validity of received 
topology information and authenticate the sender across multiple hops. However, 
this option may not be available. For example, wireless sensors may not have 
the capacity to manipulate digital signatures. Another way to limit the power of 
a Byzantine process is to assume synchrony: all processes proceed in lock-step. 
Indeed, if a process is required to send a message with each pulse, a Byzantine 
process cannot refuse to send a message without being detected. However, the 
synchrony assumption may be too restrictive for practical systems. 

Our contribution. In this study we explore the fundamental properties of 

topology discovery. Wc select the weakest practical programming model, estab- 
lish the limits on the solutions and present the programs matching those limits. 

Specifically, we consider arbitrary networks of arbitrary topology where up to 
fixed number of nodes k is faulty. The execution model is asynchronous. Wc are 
interested in solutions that do not use cryptographic primitives. The solutions 
should be terminating and the individual processes should not be aware of the 
network parameters siicli as network diameter or its total number of nodes. 

We state two variants of the topology discovery problem: weak and strong. 
In the former — either each non-faulty node learns the topology of the network 
or one of them detects a fault; in the latter — each non-faidty node has to learn 
the topology of the network regardless of the presence of faults. 

As negative results we show that any solution to the weak topology discovery 
problem can not ascertain the presence of an edge between two faulty nodes. 
Similarly, any solution to the strong variant can not determine the presence 
of a edge between a pair of nodes at least one of which is faulty. Moreover, the 
solution to the weak variant requires the network to be at least (fc-f- l)-connected. 
In case of the strong variant the network must be at least {2k + l)-connected. 

The main contribution of this study are the algorithms that solve the two 
problems: Detector and Explorer. The algorithms match the respective connec- 
tivity lower bounds. To the best of our knowledge, these are the first asyn- 
chronous Byzantine-robust solutions to the topology discovery problem that do 
not use cryptographic operations. Explorer solves the stronger problem. How- 
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ever, Detector has better message complexity. Detector either determines topol- 
ogy or signals fault in 0{Sn^) messages where i5 and n are the maximum neigh- 
borhood size and the number of nodes in the system respectively. Explorer fin- 
ishes in 0{n*) messages. We extend our algorithms to (a) terminate (b) handle 
topology changes (c) discover neighbors if ports are known (d) discover a fixed 
number of routes instead of complete topology and (e) reliably propagate arbi- 
trary information instead of topological data. 

Related work. A number of researchers employ cryptographic operations to 
counter Byzantine faults. Avromopolus et al [2] consider the problem of secure 
routing. Therein see the references to other secure routing solutions that rely 
on cryptography. Perrig et al [19] survey robust routing methods in ad hoc 
sensor networks. The techniques covered there also assume that the processes 
are capable of cryptographic operations. 

A naive approach of solving the topology discovery problem without cryp- 
tography would be to use a Byzantine-resilient broadcast [3, 6, 9, 18]: each node 
advertises its neighborhood. However all existing solutions for arbitrary topology 
known to us require that the graph topology is a priori known to the nodes. 

Let us survey the non-cryptography based approaches to Byzantine fault- 
tolerance. Most programs described in the literature [1,13,12,16] assume com- 
pletely connected networks and can not be easily extended to deal with arbitrary 
topology. Dolev [6] considers Byzantine agreement on arbitrary graphs. He states 
that for agreement in the presence of up to k Byzantine nodes, it is necessary 
and sufficient that the network is {2k + l)-connected and the number of nodes in 
the system is at least 3A: -I- 1. However, his solution requires that the nodes are 
aware of the topology in advance. Also, this solution assumes the synchronous 
execution model. Recently, the problem of Byzantine-robust reliable broadcast 
has attracted attention [3,9, 18]. However, in all cases the topology is assumed 
to be known. Bhandari and Vaidya [3] and Koo [9] assume two-dimensional grid. 
Pelc and Peleg [18] consider arbitrary topology but assume that each node knows 
the exact topology a priori. A notable class of algorithms tolerates Byzantine 
faults locally [15, 17,20]. Yet, the emphasis of these algorithms is on containing 
the fault as close to its source as possible. This is only applicable to the prob- 
lems where the information from remote nodes is unimportant such as vertex 
coloring, link coloring or dining philosophers. Thus, local containment approach 
is not applicable to topology discovery. 

Masuzawa [14] considers the problem of topology discovery and update. How- 
ever, Masuzawa is interested in designing a self-stabilizing solution to the prob- 
lem and thus his fault model is not as general as Byzantine: he considers only 
transient and crash faults. 

The rest of the paper is organized as follows. After stating our programming 
model and notation in Section 2, we formulate the topology discovery problems, 
as well as state the impossibility results in Section 3. We present Detector and 
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Explorer in Sections 4 and 5 respectively. We discuss the composition of our 
programs and their extensions in Section 6 and conclude the paper in Section 7. 

2 Notation, Definitions and Assumptions 

Graphs. A distributed system (or program) consists of a set of processes and 
a neighbor relation between them. This relation is the system topology. The 
topology forms a graph G. Denote n and e to be the number of nodes^ and edges 
in G respectively. Two processes are neighbors if there is an edge in G connecting 
them. A set P of neighbors of process p is neighborhood of p. In the sequel we use 
small letters to denote singleton variables and capital letters to denote sets. In 
particular, we use a small letter for a process and a matching capital one for this 
process' neighborhood. Since the topology is symmetric, if g € P then p e Q. 
Denote 6 to be the maximum number of nodes in a neighborhood. 

A node- cut of a graph is the set of nodes U such that G\U is disconnected 
or trivial. A node- connectivity (or just connectivity) of a graph is the minimum 
cardinality of a node-cut of this graph. In this paper we make use of the following 
fact about graph connectivity that follows from Menger's theorem (see [22]): if 
a graph is fc-connected (where k is some constant) then for every two vertices u 
and V there exists at least k internally node-disjoint paths connecting u and v 
in this graph. 

Program model. A process contains a set of variables. When it is clear from 

the context, we refer to a variable var of process p as var.p. Every variable ranges 
over a fixed domain of values. For each variable, certain values are initial. Each 
pair of neighbor processes share a pair of special variables called channels. We 
denote Ch.b.c the channel from process b to process c. Process b is the sender 
and c is the receiver. The value for a channel variable is chosen from the domain 
of (potentially infinite) sequences of messages. 

A state of the program is the assignment of a value to every variable of each 
process from its corresponding domain. A state is initial if every variable has 
initial value. Each process contains a set of actions. An action has the form 
(name) : (guard) — > (command) . A guard is a boolean predicate over the vari- 
ables of the process. A command is sequence of assignment and branching state- 
ments. A guard may be a receive-statement that accesses the incoming channel. 
A command may contain a scnd-statcment that modifies the outgoing channel. 
A parameter is used to define a set of actions as one parameterized action. For 
example, let j be a parameter ranging over values 2, 5 and 9; then a parameter- 
ized action ac.j defines the set of actions ac.(j = 2) [] ac.{j = 5) [] ac.{j = 9). 
Either guard or command can contain quantified constructs [5] of the form: 
{(quantifier) (bound variables) : (range) : (term)), where range and term are 
boolean constructs. 

Semantics. An action of a process of the program is enabled in a certain state 
^ We use terms process and node interchangeably. 
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if its guard evaluates to true. An action containing receive-statement is enabled 
when appropriate message is at the head of the incoming channel. The execution 
of the command of an action updates variables of the process. The execution of 
an action containing receive-statement removes the received message from the 
head of the incoming channel and inserts the value the message contains into 
the specified variables. The execution of send-statement appends the specified 
message to the tail of the outgoing message. 

A computation of the program is a maximal fair sequence of states of the 
program such that the first state ,so is initial and for each state .s^ the state ,Sj+i 
is obtained by executing the command of an action whose state is enabled in 
Si. That is, we assume that the action execution is atomic. The maximality of 
a computation means that the computation is cither infinite or it terminates 
in a state where none of the actions are enabled. The fairness means that if 
an action is enabled in all but finitely many states of an infinite computation 
then this action is executed infinitely often. That is, we assume weak fairness 
of action execution. Notice that we define the receive statement to appear as a 
standalone guard of an action. This means, that if a message of the appropriate 
type is at the head of the incoming channel, the receive action is enabled. Due 
to weak fairness assumption, this leads to fair message receipt assumption: each 
message in the channel is eventually received. Observe that our definition of a 
computation considers asynchronous computations. 

To reason about program behavior we define boolean predicates on program 
states. A program invariant is a predicate that is true in every initial state of the 
program and if the predicate holds before the execution of the program action, 
it also holds afterwards. Notice that by this definition a program invariant holds 
in each state of every program computation. 

Faults. Throughout a computation, a process may be cither Byzantine (faulty) 
or non-faulty. A Byzantine process contains an action that assigns to each local 
variable an arbitrary value from its domain. This action is always enabled. Yet, 
the weak fairness assumption does not apply to this action. That is, we consider 
computations where a faulty process does not execute any actions. Observe that 
we allow a faulty node to send arbitrary messages. We assume, however, that 
messages sent by such a node conform to the format specified by the algorithm: 
each message carries the specified number of values, and the values are drawn 
from appropriate domains. This assumption is not difficult to implement as mes- 
sage syntax checking logic can be incorporated in receive-action of each process. 
We assume oral record [11] of message transmission: the receiver can always cor- 
rectly identify the message sender. The channels are reliable: the messages are 
delivered in FIFO order and without loss or corruption. Throughout the paper 
we assume that the maximum number of faulty nodes in the system is bounded 
by some constant k. 

Graph exploration. The processes discover the topology of the system by 
exchanging messages. Each message contains the identifier of the process and 
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its neighborhood. Process p explored process g if p received a message with 
(g, Q). When it is clear from the context, we omit the mention of p. An explored 
subgraph of a graph contains only explored processes. A Byzantine process may 
potentially circulate information about the processes that do not exist in the 
system altogether. A process is fake if it does not exist in the system, a process 
is real otherwise. 

3 The Topology Discovery Problem: Statement and 
Solution Bounds 

Problem statement. 

Definition 1 (Weak Topology Discovery Problem). A program is a so- 
lution to the weak topology discovery problem if each of the program's com- 
putation satisfies the following properties: termination — either all non-faulty 
processes determine the system topology or at least one process detects a fault; 
safety — for each non-faulty process, the determined topology is a subset of the 
actual system topology; validity — the fault is detected only if there are faulty 
processes in the system. 

Definition 2 (Strong Topology Discovery Problem). A program is a so- 
lution to the strong topology discovery problem if each of the program's compu- 
tations satisfies the following properties: termination all non-faulty processes 
determine the system topology; safety — the determined topology is a subset of 
the actual system topology. 

According to the safety property of both problem definitions each non-faulty 

process is only required to discover a subset of the actual system topology. How- 
ever, the desired objective is for each node to discover as much of it as possible. 
The following definitions capture this idea. A solution to a topology discovery 
problem is com,plete if every non-faulty process always discovers the complete 
topology of the system. A solution to the problem is node-complete if every 
non-faulty process discovers all nodes of the system. A solution is adjacent-edge 
complete if every non- faulty node discovers each edge adjacent to at least one 
non-faulty node. A solution is two- adjacent- edge complete if every non-faulty 
node discovers each edge adjacent to two non-faulty nodes. 

Solution bounds. To simplify the presentation of the negative results in this 
section we assume more restrictive execution semantics. Each channel contains at 
most one message. The computation is synchronous and proceeds in rounds. In 
a single round, each process consumes all messages in its incoming channels and 
outputs its own messages into the outgoing channels. Notice that the negative 
results established for this semantics apply for the more general semantics used 
in the rest of the paper. 

Theorem 1. There does not exist a complete solution to the weak topology 
discovery problem. 
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Proof: Assume there exists a complete solution to the problem. Consider 
k > 2 and topology Gi that is not completely connected. Let none of the nodes 
in Gi be faulty. By the validity property, none of the nodes may detect a fault 
in such topology. Consider a computation si of the solution program where each 
node discovers Gi . Let p € Gi, q p, and r ^ p he three nodes in Gi . with q 
and r being non- neighbor nodes in Gi. Since Gi is not completely connected we 
can always find two such nodes. 

We form topology G2 by connecting q and r in Gi. Let q and r be faulty 
in G2. We construct a computation S2 which is identical to Si. That is, q and 
r, being faulty, in every round output the same messages as in si. Since S2 is 
otherwise identical to si , process p determines that the topology of the system 
is Gi ^ G2. Thus, the assumed solution is not complete. □ 

Theorem 2. There exists no node- and adjacent-edge complete; solution to the 
weak topology problem if the connectivity of the graph is lower or equal to the 
total number of faults k. 

Proof: Assume the opposite. Let there be a node- and adjacent-edge complete 
program that solves the problem for graphs whose connectivity is k or less. Let 
Gi and G2 be two graphs of connectivity k. 

This means that Gi and G2 contain the respective cut node sets Ai and 
A2 whose cardinality is k. Rename the processes in G2 such that Ai = A2. By 
definition Ai separates Gi into two disconnected sets Bi and Gi. Similarly, A2 
separates G2 into B2 and G2. Assume that Bi ^ i?2- Since Ai = A2 we can form 
graph G3 as ^1 U B2 U Gi. 

Let Si be any computation of the assmned program in the system of topology 
Gi and no faulty nodes. Since the program solves the weak topology problem, the 
computation has to comply with all the properties of the problem. By validity 
property, no fault is detected in si. By termination property, each node in Gi, 
including some node p G Gi , eventually discovers the system topology. 

By safety property the topology discovered by p is a subset of Gi. Since 
the solution is complete the discovered topology is Gi exactly. Let S2 be any 
computation of the assumed program in the system of topology G2 and no 
faulty nodes. Again, none of the nodes detects a fault and all of them discover 
the complete topology of G2 in S2- 

We construct a new computation S3 of the assumed program as follows. The 
system topology for S3 is G3 where all nodes in Ai are faulty. Each faulty node 
q G Ai behaves as follows. In the channels connecting q to the nodes of Gi C G3, 
each round q outputs the messages as in Si. Similarly, in the channels connecting 
q to the nodes of B2 C G3, g outputs the messages as in S2. The non- faulty nodes 
of B2 and Gi behave as in si and S2 respectively. 

Observe that for the nodes of B2, the topology and communication is indis- 
tinguishable from that of S2. Similarly, for the nodes of Gi the topology and 
communication is indistingiushable from that of si. Notice that this means that 
none of the non-faulty nodes detect a fault in the system. Moreover, node p G Ci 
decides that the system topology is the subset of Gi. Yet, by construction, 
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Gi 7^ G3. Specifically, Bi % 82- Moreover, none of the nodes in B2 are faulty. If 
this is the case then either S3 violates the safety property of the problem or the 
assumed solution is not adjacent-edge complete. The theorem follows. □ 

Observe that for (fc+l)-connected graphs an adjacent-edge complete solution 
is also node complete. 

Theorem 3. There docs not exist an adjacent-edge complete solution to the 

strong topology discovery problem. 

Proof: Assume such a solution exists. Consider system graph G\ that is not 
completely connected. Let p e Gi be an arbitrary node. Let q^p and r ^ phe 
two non- neighbor nodes of Gi. We form topology G2 by connecting q and r in 
Gi. 

We construct computations si and S2 as follows. Let si and S2 be executed 
on Gi and G2 respectively. And let q be faulty in si and r be faulty in 82- Set 
the output of q in each round to be identical in si and 82- Similarly, set the 
output of r to be identical in both computations as well. Since the output of q 
and r in both computations is identical, we construct the behavior of the rest of 
the nodes in si and 82 to be the same. 

Due to termination property, p has to decide on the system topology in both 
computations. Due to the safety property, in si process p has to determine that 
the topology of the graph is a subset of Gi . However, since the behavior of p in 
S2 is identical to that in si, p decides that the topology of the system graph is 
Gi in 82 as well. This means p does not include the edge between q and r to the 
explored topology in 82- Yet, one of the nodes adjacent to this edge, namely q, 
is not faulty. An adjacent-edge complete program should include such edges in 
the discovered topology. Therefore, the assumed program is not adjacent-edge 
complete. □ 

Theorem 4. There exists no node- and two-adjacent-cdge complete solution to 
the strong topology problem if the connectivity of the graph is less than or equal 
to twice the total number of faults k. 

Proof: Assume that there is a program that solves the problem for graphs 
whose connectivity is 2k or less. Let Gi and G2 bo two different graphs whose 
connectivity is 2k. Similar to the the proof of Theorem 2, we assume that Gi = 
AiUBiU Ci and G2 = ^2 U B2 U G2 where the cardinality of Ai and A2 are 2k, 
Ai = A2, Bi n Gi = 0, B2 n G2 = 0, and Bi % B2. Form G3 ^ AiU B2U Gi. 
Divide Ai into two subsets A[ and A'{ of the same number of nodes. 

Construct a computation si with system topology Gi where all nodes in 
A'l are faulty; and another computation .S3 with system topology G3 where all 
nodes in A'{ are faulty. The faulty nodes in si in the channels connecting A[ 
to Gi communicate as the (non- faulty) nodes of A[ in S3. Similarly, the faulty 
nodes in S3 in the channels connecting A'( to Gi communicate as the nodes 
of A'{ in Si. Observe that Si and S3 are indistinguishable to the nodes in Gi. 
Let the nodes in Gi , including p€ Ci behave identically in both computations. 
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According to the termination property of the strong topology discovery problem 
every node, including p has to determine the system topology in both si and S3. 
Due to safety, the topology that p determines in si is a subset of G\. However, 
p behaves identically in S3. 

This means that p decides that the system topology in S3 is also a subset of 
Gi. Since Gi ^ G3 (specifically, Bi % B2), and that none of the nodes in B2 are 
faulty, this implies that either S3 violates the safety property of the problem or 
the assumed solution is not adjacent-edge complete. The theorem follows. □ 

4 Detector 

Outline. Detector solves the weak topology discovery problem for system graphs 
whose connectivity exceeds the number of faulty nodes k. The algorithm lever- 
ages the connectivity of the graph. For each pair of nodes, the graph guarantees 
the presence of at least one path that does not include a faulty node. The topol- 
ogy data travels along every path of the graph. Hence, the process that collects 
information about another process can find the potential inconsistency between 
the information that proceeds along the path containing faulty nodes and the 
path containing only non-faulty ones. 

Care is taken to detect the fake nodes whose information is introduced by 
faulty processes. Since the processes do not know the size of the system, a faulty 
process may potentially introduce an infinite number of fake nodes. However, the 
graph connectivity assumption is used to detect fake nodes. As faulty processes 
are the only source of information about fake nodes, all the paths from the real 
nodes to the fake ones have to contain a faulty node. Yet, the graph connectivity 
is assumed to be greater than k. If a fake node is ever introduced, one of the 
non-faulty processes eventually detects a graph with too few paths leading to 
the fake node. 

Detailed Description. The program is shown in Figure 1. Each process p stores 
the identifiers of its immediate neighbors. They are kept in set P. Each process 
keeps the upper bound k on the number of faulty processes. Process p maintains 
the following variables. Boolean variable detect indicates if p discovers a fault 
in the system. Boolean variable start guards the execution of the action that 
sends p's neighborhood information to its neighbors. Set TOP (for topology) 
stores the subgraph explored by p: TOP contains tuples of the form: {process 
identifier, its neighborhood). In the initial state, TOP contains ip,P). 

Function path_number evaluates the topology of the subgraph stored in 
TOP. Recall that a node u is unexplored by p if for every tuple (s, S) £ TOP, 
s is not the same as u. That is u may appear in S only. We construct graph G' 
by adding an edge to every pair of unexplored processes present in TOP. We 
calculate the value of path_number as follows. If the information of TOP is 
inconsistent, that is: 

{3u, v,U,V : {{u, U) G TOP) A {{v, V) G TOP) : 
{uGV)Aiv^ U)) 
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process p 
const 

P: set of neighbor identifiers of p 

k: integer, upper bound on the number of faulty processes 
pEirameter 
q:P 

var 

detect : boolean, initially false, signals fault 

start : boolean, initially true, controls sending of p's neighborhood info 
TOP : set of tuples, initially {{p,P)}, (process ids, neighbor id set) 

received by p 

*[ 

init: start — > 

start := false, 

(Vj : j € P : send (p, P) to j) 

D 

accept: receive (r, R) from q — > 

if (3s, S : (s, S) € TOP : s = r A S ^ R) V 

(path_number(TOP U {(r, i^)}) < A; + 1) 
then 

detect :— true 

else 

if {$s, S : (s, S) € TOP : s = r) then 
TOP := TOP U {(r, 
(Vj : j e P : send (r, R) to j) 



Fig. 1. Process of Detector 



then path_number returns 0. If there is exactly one explored node in TOP, 
path_number returns k+1. Otherwise the function returns the minimum num- 
ber of internally node disjoint paths between two explored nodes in G' . In the 
correctness proof for this program we show that unless there is a fake node, the 
path_number of G' is no smaller than the connectivity of G. 

Processes exchange messages of the form {process identifier, its neighborhood 
id set). A process contains two actions: init and accept. Action init starts the 
propagation of p's neighborhood throughout the system. Action accept receives 
the neighborhood data of some process, records it, checks against other data 
already available for p and possibly further disseminates the data. If the data 
received from neighbor q about a process r contradicts what p already holds 
about r in TOP or if the newly arrived information implies that G is less than 
(fc + l)-conncctcd p indicates that it detected a fault by setting detect to true. 
Alternatively, if p did not previously have the information about r, p updates 
TOP and sends the received information to all its neighbors. 
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Observe that the propagation of information about the neighborhood of a 
certain process is independent of the information propagation of another process. 
Thus, we will focus on the propagation of the information about a particular 
non-faulty process a. 

Let COR contain each process h such that h is not faulty and TOP.b holds 
(a, A). Let a itself belong to COR if start.a is false. 

Lemma 1. The following predicate is an invariant of Detector. 

(V non-faulty h,c:h& COR, c G B : 
(c e COR)V 

{{a, A) e Ch.b.c)) V ^ ' 

(3 non-faulty j : j <E N : detect.] = true) 

The predicate states that unless one of the non-faulty processes in the pro- 
gram detects a fault, if a process b belongs to COR then each neighbor c of 6 
either belongs to COR as well or the channel from 6 to c contains (a. A). 



Proof: To prove that Predicate 1 is an invariant of the program, wc need 
to show that it holds in the initial state of any computation and it is closed 
under the execution of actions of Byzantine as well as non-faulty processes. The 
predicate holds initially as the first disjunct is vacuously true. 

Note that no action of a Byzantine process immediately affects the validity 
of the predicate. Observe also that a non-faulty process can only set detect to 
true. Thus, once this happens the predicate holds throughout the rest of the 
computation. Suppose detect is false in all processes of the program. Then the 
predicate is violated only if there is a non-faulty pair of neighbors b and c such 
that b belongs to COR, c does not and there is no message (a. A) in the channel 
from b to c. Notice that a non-faulty process adds the first value (r, R) to TOP 
and never changes it afterwards. Thus, provided that detect = false, to violate 
the predicate, a process has to join COR without sending (a. A) to its neighbors 
or consume a message with {a. A) without joining COR. Let us examine the 
actions of a non-faulty process and ensure that neither of this happens. 

Observe that init is only of interest in a. This action sets start.a = false 
which, by definition, adds a to COR. Also, init atomically sends {a, A) to all 
neighbors of a. Thus, the predicate is not violated by the execution of init. 

Let us now consider accept in an arbitrary non- faulty process u. Let the 
message received by u carry {r,R). Observe that accept affects Predicate 1 only 
if r = o. accept may make u join COR or consume a message with (a. A). Notice, 
that if u is already in COR the receipt of a message with (a. A) does not violate 
the predicate. Also, u joins COR only if it receives {a, A). Hence, the only case 
we have to consider is when u does not belong to COR before the execution of 
accept, u receives {a. A) and joins COR. 

The behavior of u in this case depends on whether it has an element (s, S) 
in TOP.u such that s = a. Since u ^ COR, if (a, S) G TOP.u, then S differs 
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from A. In this case if u receives (a, A) then it sets detect = true. This preserves 
the validity of the predicate. Alternatively, if such an entry in TOP.u does not 
exist, then the receipt of (a. A) causes u to join COR and forward (o. A) to all 
its neighbors. This preserves the predicate as well. 

Thus, Predicate 1 holds in the initial state of every computation of the pro- 
gram and is preserved by its every action. Which means that this predicate is 
an invariant of the program. □ 

Lemma 2. If a computation of Detector contains a state where there is a pro- 
cess u that belongs to COR that has a non- faulty neighbor v that does not, then 
further in the computation, either some non-faulty process sets detect = true 
or V joins COR. 

Proof: According to Lemma 1, Predicate 1 is an invariant of the program. 
Hence, if u belongs to COR and its non-faulty neighbor v does not, then channel 
Ch.u.v contains a message with (a, A). Due to fair message receipt assumption, 
{a, A) is received. Observe that if v is not in COR and it receives {a, A), then 
either v sets detect = true or joins COR. □ 

Lemma 3. Every computation of Detector contains a state where either detect = 
true in some non-faulty process or every non-faulty process belongs to COR. 

Proof: The proof is by induction on the number of non-faulty processes in 
the program. As a base case, we show that a itself eventually joins COR. Recall, 
that we assume that a itself is not faulty. Observe that the program starts in a 
state where start.a is true. If this is so, init is enabled. Moreover, init is the 
only action that sets start.a to false. Thus, init stays enabled until executed. 
By weak fairness assumption, init is eventually executed. When this happens, a 
joins COR. 

Assume that COR contains i: 1 < i < n processes at some state of a com- 
putation and there is a non-faulty process that does not belong to COR. We 
assume that the connectivity of the graph exceeds the maximum number of 
faulty processes. Thus, there is a non-faulty process u £ COR that has a non- 
faulty neighbor v COR. According to Lemma 2, this computation contains 
a state where COR contains v. Thus, every non-faulty process eventually joins 
COR. □ 

Lemma 4. If a computation of Detector contains a state where non-faulty pro- 
cess u explores a fake process v, then this computation contains a state where 
detect = true in some non-faulty process. 

Proof: Observe that the only source of fake process information is a Byzan- 
tine process. Hence, if u explores a fake process v, then every path to v leads 
through a Byzantine process. Thus, in a graph with a fake node, the maximum 
number of node-disjoint paths between a real and a fake node is no more than 
k. 

According to Lemma 3, eventually, either detect = true at a non- faulty 
process or u explores every non-faulty process in the system. In this case u 



12 



detects that all paths to the fake node v lead through no more than k processes 
and sets detect — true. □ 



Lemma 5. If the system does not have a faulty process, then in every computa- 
tion, for each process, the path_number of the explored subgraph G' is greater 
than k. 

Proof: Observe that if there are no faulty processes, only correct topology 
information is circulated in the system. Hence, for each process u, TOP.u con- 
tains the subgraph of the system graph G. In this case, G'.u is an arbitrary set of 
explored processes from G and the unexplored members of their neighborhoods. 
By the construction of G' .u, every pair of unexplored processes is connected by 
an edge. 




Fig. 2. Illustration for the proof of Lemma 5: construction of path P' C G' on the 
basis of path P C G 

Let V and w be an arbitrary pair of explored nodes in G'.u. And let P be 
a path connecting v and w in G. We claim that there exists a path P' in G'.u 
connecting v and w that is also a node-subset of P. That is, every node that 
belongs to P' also belongs to P. See Figure 2 for the illustration. If P contains 
only the nodes explored in G'.u, our claim holds since P' = P. Let P contain un- 
explored nodes as well. In general, P contains alternating segments of explored 
and unexplored nodes. Let {xi,yi, ■ ■ ■ , y^+i, Sj+i) be any such unexplored seg- 
ment, where Xi, Xi+i are explored and yi, - • • , yi+i are not. Observe that and 
have explored neighbors — Xi and Xi+i respectively. This means that both 
yi and y^+i belong to G' .u. Since yi and y^+i are unexplored, G'.u contains an 
edge connecting them. We construct P' to contain every explored segment of P; 
we replace every unexplored segment by the edge that links unexplored nodes in 
G'.u. Observe that by construction, P' G G'.u and P' contains a subset of the 
nodes of P. Thus, our claim holds. 

Let Pi and P2 be two internally node disjoint paths connecting v and w in G. 
According to the just proved claim, there exist P{ and P2 belonging G'.u that 
connect v and w. Moreover, P[ contains a subset of nodes of Pi and P2 contains 
a subset of nodes of P2. Since Pi and P2 are internally node disjoint, so are P{ 
and P^. 



13 



Recall that G is assumed to be [k + l)-connected. This means that for every 
two vertices v and w there exist fc + 1 internally node disjoint paths between v 
and w. Thus, the number of internally node disjoint paths for v and w in G' .u 
is at least k + 1. Hence, the path_number of G' .u is greater than k. □ 

Lemma 6. Any computation of a detector program contains a state where a 
Byzantine process is detected only if there indeed is a Byzantine process in the 
system. 

Proof: A non-faulty process sets detect to true if it encounters divergent in- 
formation about some node's neighborhood or when it detects that path_number 
is less than k + 1. However, a non-faulty process never modifies the neighborhood 
information about other processes. Hence, if the program does not have a faulty 
process, all the information about a particular neighborhood that is circulated 
in the system is identical. Also, according to Lemma 5 if there are no faulty pro- 
cesses in the system, the path_number never falls below A; -|- 1. Hence, detect 
is set to true only if indeed the system contains a faulty process. □ 

Theorem 5. Detector is an adjaccnt-cdgc complete solution to the weak topol- 
ogy discovery problem in case the connectivity of system topology graph exceeds 
the number of faults. 

Proof: To prove the theorem we show that every computation of Detector 
conforms to the properties of the problem. We then show that the discovered 
topology is adjacent-edge complete. 

Termination property follows from Lemma 3, safety — from Lemma 4, while 
validity follows from Lemma 6. Notice that Lemma 3 states that unless a fault is 
detected, the neighborhood of every non-faulty process is added to COR. That 
is, edges adjacent to a non-faulty processes are detected by every non-faulty 
processes. Thus, Detector is adjacent-edge complete. Hence the theorem. □ 

Efficiency evaluation. Since we consider an asynchronous model, the number 
of messages a Byzantine process can send in a computation is infinite. To evaluate 
the efficiency of Detector we assume that each process is familiar with the upper 
boimd on the number of processes in the system and this upper bound is in 0{n). 
A non-faulty process then detects a fault if the number of processes it explores 
exceeds this bound or if it receives more than one identical message from the 
same neighbor. We assume that the process stops and does not send or receive 
any more messages if it detects a fault. 

In this case we can estimate the number of messages that are received by non- 
faulty processes before one of them detects a fault or before the computation 
terminates. To make the estimation fair, the assume that the unit is log{n) bits. 
Since it takes that many bits to assign unique process identifiers to n processes, 
we assume that one identifier is exactly one unit of information. A message in 
Detector carries up to (5 -|- 1 identifiers, where 5 is the maximum number of nodes 
in the neighborhood of a process. Observe that a process can receive at most n 
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messages from each incoming channel. Thus, the total number of messages that 
can be sent by Detector is 2en, where e is the number of edges in the graph. 
The message complexity of the program is in 0{2enS). If e is proportional to n^, 
then the complexity of the program is in 0{Sn^). 

5 Explorer 

Outline. The main idea of Explorer is for each process to collect information 
about some node's neighborhood such that the information goes along more than 
twice as many paths as the maximum number of Byzantine nodes. While the 
paths are node-disjoint, the information is correct if it comes across the majority 
of the paths. In this case the recipient is in possession of confirmed information. 
It turns out that the topology information does not have to come directly from 
the source. Instead it can come from processes with confirmed information. The 
detailed description of Explorer follows. 

To simplify the presentation, we describe and prove correct the version of 
Explorer that tolerates only one Byzantine fault. We describe how this version 
can be extended to tolerate multiple faults in the end of the section. 

Description. Since we first describe the 1-fault tolerant version of Explorer we 
assume that the graph is 3-connected. The program is shown in Figure 3. Similar 
to Detector, each process p in Explorer, stores the ids of its immediate neighbors. 
Process p maintains the variable start, whose function is to guard the execution 
of the action that initiates the propagation of p's own neighborhood. Unlike 
Detector, however, p maintains two sets that store the topology information 
of the network: uTOP and cTOP. Set uTOP stores the topology data that 
is unconfirmed; cTOP stores confirmed topology data. Set uTOP contains the 
tuples of neighborhood information that p received from other nodes. Besides 
the process id and the set of its neighbor ids, each such tuple contains a set of 
process identifiers, that relayed the information. We call it visited set. The tuples 
in cTOP do not require a visited set. 

Processes exchange messages where, along with the neighbor identifiers for 
a certain process, a visited set is propagated. A process contains two actions: 
init and accept. The purpose of init is similar to that in the process of Detec- 
tor. Action accept receives the neighborhood information of some process r, its 
neighborhood R which was relayed by nodes in set S. The information is received 
from p's neighbor — q. 

First, accept checks if the information about r is already confirmed. If so, 
the only manipulation is to record the received information in uTOP. Actually, 
this update of uTOP is not necessary for the correct operation of the program, 
but it makes the its proof of correctness easier to follow. 

If the received information does not concern already confirmed process, accept 
checks if this information differs from what is already recorded in uTOP cither 
in r or in R. In either case the information is broadcast to all neighbors of p. 
Before broadcasting, p appends the sender — g to the visited set S. 
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If the information about r and R has aheady been received and recorded 
in uTOP, accept checks if the previously recorded information came along an 
internally node disjoint path. If so, the information about r is added to cTOP. In 
this case, this information is also broadcast to all p's neighbors. Note, however, 
that p is now sure of the information it received. Hence, the visited set of nodes 
in the broadcast message is empty. 



process p 
const 

P, set of neighbor identifiers of p 
pcirameter 
q:P 

var 

start : boolean, initially true, controls sending of p's neighbor ids 
cTOP : set of tuples, initially {(p,P)}, 

(process id, neighbor id set) confirmed topology info 
uTOP : set of tuples, initially 0, 

(process id, neighbor id set, visited id set) 

unconfirmed topology info 

*[ 

init: start — > 

start :— false, 

(Vj : i e P : send (p, P, 0) to j) 

D 

accept: receive (r, R, S) from q — > 

if (it, T : {t, T) e cTOP : t ^ r) then 

if (Vt,T,;7: {t,T,U) euTOP -.ty^rVT ^ R) then 

(Vj : j G P : send (r, R,SU {q}) to j) 
elsif (3i, T, U : {t, T, U) € uTOP : 

t^r ^R = T ^((Uf^(S\J {q}))) C {r})) 

then 

cTOP := cTOPyj{{r, R)}, 
(Vj : j € P : send (r, P, 0) to j) 
uTOP := uTOP U {(r, P, S U {q})} 



Fig. 3. Process of Explorer 



Correctness proof. Just like for the Detector algorithm, wc arc focusing on the 
propagation of the neighborhood information A of a singular non-faulty process 
a. Notice that we use A to denote the correct neighborhood info. We use A' for 
the neighborhood information of a that may not necessarily be correct. 

To aid us in the argument, we introduce an auxiliary set SENT to be main- 
tained by each process. Since this set does not restrict the behavior of processes, 
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we assume that the Byzantine process maintains this set as well. SENT con- 
tains each message sent by the process throughout the computation. Notice that 
uTOP records every message received by the process in the computation. Hence, 
the comparison of uTOP and SENT allows us to establish the channel contents. 

Since, a message cannot be received without being sent and vice versa, the 
following proposition states the invariant of the predicate that affirms it. 

Proposition 1. The following predicate is an invariant of the Explorer pro- 
gram. 

(V6, non-faulty c, A' ,V : c & B : 
{{{a,A',V)&Ch.h.c)y 

((a. A', V U {b}) G uTOP.c)) <^ ^ ^ 

{{a,A',V) e SENT.b)) 

The predicate states that for any process b and its non-faulty neighbor c the 

information about the neighborhood of a is recorded in SENT.b if and only if 
this information is en route from & to c or is recorded in uTOP.c with b appended 
to the sequence of visited nodes V. 

Before wc proceed with the correctness argument wc have to introduce addi- 
tional notation. We say that some process c confirms (a. A') if it adds this tuple 
to cTOP.c. We view the propagation of A' as construction of a tree of processes 
that relayed A' . This tree carries A' . A tree contains two types of nodes: a root 
and non-root. If process c is non-root, then for some V, {a,A',V) € SEND.c 
and {a,A',V) G uTOP.c. That is, a non-root is a process that forwarded the 
information received from elsewhere without alteration. If c is a root, then 
{a,A',V) G SEND.c but {a,A',V) ^ uTOP.c. Node c's ancestor in a tree 
is the node that lies on a path from c to the root. 

Observe that the root of a tree can only be the process a itself, the Byzantine 
node or a node that confirms {a, A'). Notice also that since each non-faulty 
process c sends a message about a's information at most twice, c can belong to 
at most two trees. Moreover, c has to be the root of one of those trees. 

The proposition below follows from Proposition 1. 

Proposition 2. If some process d is the ancestor of another process c in a tree 

carrying (a, A') and (a, A', V) G uTOP.c, then dGV. 

Lemma 7. If a non-faulty node c confirms (a, A'), then A' = A and a is real. 

Proof: Let us first suppose that a is real. Further, suppose c is the first non- 
faulty process in the system, besides a, to confirm {a, A'). To add {a, A') to 
cTOP.c any process c ^ a has to contain (a. A' , V) G uTOP.c and receive a 
message from one of its neighbors b carrying (a, A', V) such that V f^V C {a}. 
In our notation this means that c belongs to a tree that carries {a, A') and 
receives a message from b (possibly belonging to a different tree) that carries 
the same information: [a, A'). Let us consider if b and c belong to the same or 
different trees. 
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Suppose b and c belong to the same tree. If this is the case the messages 
that c receives have to share nodes in the visited sets V and V'. However, for 
c to confirm {a, A') the intersection of V and V has to be a subset of {a}. 
That is, the only common node between the two sets is a. Observe that a does 
not forward the information about its own neighborhood if it rccciivcis it from 
elsewhere. Thus, if a belongs to a tree then a is its root. In this case A' = A. 

Suppose b and c belong to different trees. Recall that for c to confirm (a. A'), 
both of these trees have to carry (a, A'). However, \i A ^ A then the root of 
the tree is either the faulty node or another node that confirmed (a, A'). Yet, we 
assumed that c is the first node to do so. Thus, if c receives a message from 6, 
the only tree that carries the information (a, A') such that A ^ A\s, rooted in 
the faulty node. Thus, even if b and c belong to different trees, A! — A. 

Similarly, if a is fake, unless another node confirms (a, A') there is only one 
tree that carries (a. A!) and it is rooted in the faulty node. In this case, no other 
node confirms (a, A'). □ 

Lemma 8. Every computation of Explorer contains a state where each non- 
faulty process belongs to at least one tree carrying (a, ^4). 

Proof: We prove the lemma by induction on the number of nodes in the 
system. To prove the base case we observe that the init action is enabled in a in 
the beginning of every computation. This action stays enabled unless executed. 
Thus, due to weak-fairness of action execution assumption, init is eventually 
executed in a. When it is executed, a forms a tree carrying {a, A). 

Let us assume that there are i: 1 < i < n non-faulty nodes that belong to 
trees carrying {a, A). Since the network is at least 3-connected, there exists a 
non-faulty process c that does not belong to such a tree but has a neighbor b 
that does. 

If b belongs to a tree carrying (a, A) then SEND.b contains an entry (a, A, V) 
for some set of visited nodes V . li c does not belong to such a tree then, by 
definition, (a, A, V) ^ uTOP.c. In this case, according to Proposition 1, Ch.b.c 
contains (a,A,V). Similar argument applies to the other neighbors of c that 
belong to trees carrying (a, A) . That is, c has incoming messages from every 
such neighbor. 

According to the fair message receipt assumption, these messages are eventu- 
ally received. We can assume, without loss of generality, that c receives a message 
from b first. Since c does not contain an entry (a. A, V) in uTOP.c, upon re- 
ceipt of the message from b, c sends a message with (a, A, VU{b}), attaches this 
message to SEND.c and includes it in uTOP.c. This means that c joins the tree 
carrying (a. A). 

Thus, every non-faulty node eventually joins a tree carrying correct neigh- 
borhood information about a. □ 

A branch of a tree is either a subtree without the root or the root process 
alone. The following proposition follows from Proposition 1. 
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Proposition 3. If a computation of Explorer contains a state where a non- 
faulty node c and its neighbor b either belong to two difTcrcnt trees carrying the 
same information (a, A) or to two different branches of the tree rooted in a, then 
this computation also contains a state where c confirms (a, A). 

Lemma 9. Every non-faulty process c eventually confirms (a, A). 

Proof: The proof is by induction on the number of nodes in the system. The 
base case trivially holds as a itself confirms {a, A) in the beginning of every 
computation. Assume that i non-faulty processes have (a, A) in cTOP, where 
1 <i < n. We show that if there exists another non- faulty process c, it eventually 
confirms (a, ^). Two cases have to be considered: there exists only one tree 
carrying {a, A), and there are multiple such trees. 

Let us consider the first case. Notice, that in every computation there even- 
tually appears a tree rooted in a. In this case, we may only consider a tree so 
rooted. Since the network is at least 3-connected, there exists a simple cycle con- 
taining a and not containing the faulty process. According to Lemma 8, every 
process in the cycle eventually joins this tree. Observe that, by our definition 
of a tree branch, there always is a pair of neighbor processes b and c that be- 
long to different branches of a tree rooted in a and carrying (a. A). In this case, 
according to Proposition 3, one of the two nodes eventually confirms (a. A). 

Let us now consider the case of multiple trees carrying {a, A). Again, accord- 
ing to Lemma 8, each non-faulty process in the system joins at least one of these 
trees. Since the network is at least 3-connected there exists a non-faulty process 
c belonging to one tree that has a neighbor b belonging to a different tree. In 
this case, according to Proposition 3, c confirms (a, A). 

By induction, every non-faulty process in the system eventually confirms 
{a, A). □ 

Theorem 6. Explorer is a two-adjacent-edge complete solution to the strong 
topology discovery problem in case of one fault and the system topology graph 

is at least 3-connected. 

Proof: Explorer conforms to the termination and safety properties of the 
problem as a consequence of Lemmas 9 and 7 respectively. 

Observe that a non-faulty node may potentially confirm incorrect neighbor- 
hood information about a Byzantine node. That is, an edge reported by the 
faulty process is either missing or fake. However, due to the two above lem- 
mas, if two nodes are non-faulty the information whether there is an adjacent 
edge between them is discovered by every non-faulty node. Hence Explorer is 
two-adjacent-edge complete. □ 

Modification to handle fc > 1 faults. Observe that Explorer confirms the 

topology information about a node's neighborhood, when it receives two mes- 
sages carrying it over internally node disjoint paths. Thus, the program can 
handle a single Byzantine fault. Explorer can handle fc > 1 faults, if it waits 



19 



until it receives k + 1 messages before it confirms the topology info. All the mes- 
sages have to travel along internally node disjoint paths. For the correctness of 
the algorithm, the topology graph has to be {2k + l)-connected. 

Proposition 4. Explorer is a two-adjaccnt-cdgc complete solution to the strong 
topology discovery problem in case of k faults and the system topology graph is 
at least {2k + l)-connected. 

Efficiency evaluation. Unlike Detector, Explorer does not quit when a fault is 

discovered. Thus, the number of messages a faulty node may send is arbitrary 
large. However, we can estimate the message complexity of Explorer in the ab- 
sence of faults. Each message carries a process identifier, a neighborhood of this 
process and a visited set. The number of the identifiers in a neighborhood is no 
more than 5, and the number of identifiers in the visited set can be as large as 
n. Hence the message size is bounded hy 5 + n + l which is in 0{n). 

Notice, that for the neighborhood A of each process a, every process broad- 
casts a message twice: when it first receives the information, and when it con- 
firms it. Thus, the total number of sent messages is 4e-n and the overall message 
complexity of Explorer if no faults are detected is in O(n^). 

6 Composition and Extensions 

Composing Detector and Explorer. Observe that Detector has better mes- 
sage complexity than Explorer if the neighborhood size is bounded. Hence, if the 
incidence of faults is low, it is advantageous to run Detector and invoke Explorer 
only if a fault is detected. We assume that the processes can distinguish between 
message types of Explorer and Detector. In the combined program, a process 
running Detector switches to Explorer if it discovers a fault. Other processes 
follow suit, when they receive their first Explorer messages. They ignore Detec- 
tor messages henceforth. A Byzantine process may potentially send an Explorer 
message as well, which leads to the whole system switching to Explorer. Observe 
that if there are no faults, the system will not invoke Explorer. Thus, the com- 
plexity of the combined program in the absence of faults is the same as that of 
Detector. Notice that even though Detector alone only needs (fc-l-l)-connectivity 
of the system topology, the combined program requires {2k + l)-connectivity. 

Message Termination. We have shown that Detector and Explorer comply 

with the functional termination properties of the topology discovery problem. 
That is, all processes eventually discover topology. However, the performance 
aspect of termination, viz. message termination, is also of interest. Usually an 
algorithm is said to be message terminating if all its computations contain a 
finite number of sent messages [4] . 

However, a Byzantine process may send messages indefinitely. To capture 
this, we weaken the definition of message termination. We consider a Byzantine- 
tolerant program message terminating if the system eventually arrives at a state 
where: (a) all channels are empty except for the outgoing channels of a faulty 



20 



process; (b) all actions in non-faulty processes are disabled except for possibly the 
receive-actions of the incoming channels from Byzantine processes, these receive- 
actions do not update the variables of the process. That is, in a terminating 
program, each non-faulty process starts to eventually discard messages it receives 
from its Byzantine neighbors. 

Making Detector terminating is fairly straightforward. As one process detects 
a fault, the process floods the announcement throughout the system. Since the 
topology graph for Detector is assumed (fc-l- l)-connected, every process receives 
such announcement. As the process learns of the detection, it stops processing or 
forwarding of the messages. Notice that the initiation of the flood by a Byzantine 
node itself, only accelerates the termination of Detector as the other processes 
quickly learn of the faulty node's existence. 

The addition of termination to Explorer is more involved. To ensure termi- 
nation, restrictions have to be placed on message processing and forwarding. 
However, the restrictions should be delicate as they may compromise the live- 
ness properties of the program. By the design of Explorer, each process may 
send at most one message about its own neighborhood to its neighbors. Hence, 
the subsequent messages can be ignored. However, a faulty process may send 
messages about neighborhoods of other processes. These processes may be real 
or fake. We discuss these cases separately. 

Note that each process in Explorer can eventually obtain an estimate of the 
identities of the processes in the system and disregard fake process information. 
Indeed, a path to a fake node can only lead through faulty processes. Thus, if 
a process discovers that there may be at most k internally node disjoint paths 
between itself and a certain node, this node is fake. Therefore, the process may 
cease to process messages about the fake node's neighborhood. Notice, that 
since the system is {2k + l)-conncctcd, messages about real nodes will always be 
processed. Therefore, the liveness properties of Explorer are not affected. 

As to the real processes, they can be either Byzantine or non-faulty. Recall 
that each non-faulty process of Explorer eventually confirms neighborhoods of 
all other non-faulty processes. After the neighborhood of a process is confirmed, 
further messages about it are ignored. 

The last case is a Byzantine process u sending a message to its correct neigh- 
bor V about the neighborhood of another Byzantine process w. By the design of 
Explorer, v relays the message about w provided that the neighborhood infor- 
mation about w differs from what previously received about w. As we discussed 
above, eventually v estimates the identities of all real processes in the system. 
Therefore, there is a finite number of possible different neighborhoods of w that 
u can create. Hence, eventually they will be exhausted, and v starts ignoring 
further messages form u about w. 

Thus, Explorer can be made terminating as well. 

Handling topology updates. In the topology discovery problem statement, 
it is assumed that the system topology does not change. However, Detector and 
Explorer can be adapted to manage topology changes as well. There are two 
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aspects of topology change: the notification and the transport. For notification, 
a node should inform the others of its most up-to-date neighborhood. The trans- 
port aspect should ensure that this notification is delivered to all nodes despite 
of topology changes. 

We implement the transport aspect as follows. If a node p, due to the change 
in topology, obtains a new neighbor q. Then p sends to q the most recent neigh- 
borhood information about all nodes that p is aware of. Thus, the most recent 
information gets propagated regardless of topology changes. 

The satisfaction of the notification aspcc;t is more involved. Observe, however, 
that apart from detecting fake nodes in Explorer, both algorithms propagate the 
information of one process neighborhood independently of the others. We first 
describe how this propagation can be done in case the topology changes and 
then address the fake node detection. Each time the neighborhood of a process 
p changes, p starts a new version of the topology discovery algorithm for its 
neighborhood. Observe that a faulty process may also start a new version for p. 

The versions are distinguished by version numbers. Each process maintains 
the version numbers of p. Each related message carries the version number. 
Each process outputs the discovered neighborhood of p with the highest received 
version number. Observe that in the case of Explorer the processes only output 
confirmed information. Notice that if a faulty process sends incorrect information 
about p's neighborhood with a certain version number, this incorrect info will 
be handled by the basic Detector or Explorer within that version. For example, 
the faulty messages of version i about p's neighborhood will be countered by the 
correct messages of the same version. Notice that a faulty process in Explorer 
may start a version j for p's neighborhood such that it is higher than the highest 
version i that p itself started. However, according to the basic Explorer, the 
incorrect information in version j will not be confirmed. 

There are two specific modifications to the basic Detector. If the faulty pro- 
cess sends a message concerning p with the version number higher than that of 
p, p itself detects the fault. To detect fake nodes generated by a faulty process, 
each node has to compile the topology TOP graph of the highest version number 
for each node in the system and ensure that its connectivity does not fall below 
k + 1. Observe that Detector is unable to differentiate between temporary lack of 
connectivity from malicious behavior of the faulty nodes. Therefore, the connec- 
tivity of the discovered network at each node should never fall below A: -|- 1. For 
that, we assume that throughout a computation the intersection of all system 
topologies is fc -|- 1-connectcd. This assumption is not necessary for Explorer. 

The notification mechanism can be optimized in obvious ways. For Detector, 
each process has to keep the information for p with only the highest version 
number. Obsolete information can be safely discarded. For Detector, the process 
may keep the latest version of confirmed neighborhood information. Observe 
that this extension of the topology discovery algorithms assumes infinite-size 
counters. Care must be taken when implementing these counters in the actual 
hardware, as the faulty processes may try to compromise topology discovery 
if the counter values are reused. Hence, such an implementation would require 
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a Byzantine-robust counter synchronization algorithm. Lamport and MeUiar- 
Smith [10] proposed such algorithm for completely connected systems. Extending 
it to arbitrary topology systems is an attractive avenue of future research. 

Discovering neighbors. As described, in the initial state of Detector and 

Explorer, each process has access to correct information about its immediate 
neighborhood. Note that, in general, obtaining this information in the presence 
of Byzantine processes may be difficult as they can mount a Sybil attack [7] . In 
such an attack, a faulty process is able to send a message and put an arbitrary 
process identifier as the sender of this message. That is, a faulty process assumes 
the identity of this process. Sybil attack is diflacult to handle. However, Detector 
and Explorer can be modified to handle neighborhood discovery with known 
ports. That is, each process does not know the identities of its neighbors but can 
determine if a message is coming from the same process. 

The modified algorithms contain two phases: neighborhood discovery phase 
and topology discovery proper phase. In the first phase, each process broadcasts 
its identifier to its neighbors. Observe that faulty processes may not send these 
initial messages at all. Thus, the process should not wait for a message from 
every possible neighbor. Instead, as soon as each process p gets a message with q 
in its identifier, p may start the second phase with {q} as its neighborhood. Every 
time p gets a new distinct identity, p treats it as topology update, increments its 
counter and re-initiates the topology discovery. This procedure can be further 
streamlined. Recall that for Detector and Explorer the topology graph has to be 
respectively k + 1 and 2fc+ 1-connected. Thus, depending on the algorithm, each 
process is guaranteed to have fc + 1 or 2fc + 1 non-faulty neighbors. Therefore, 
each process may delay initiating topology discovery until it gets this minimum 
number of distinct identities. 

Observe that due to known ports a faulty process may not be able to use more 
than one identifier per neighbor without being detected. However, the modified 
algorithms may not be able to determine the identifier of a faulty process as 
it may select an arbitrary one. including the identifier of an already existing 
process. Thus, a pair of colluding faulty nodes may deceive their non-faulty 
neighbors into believing that they share an edge. This behavior is illustrated 
in Figure 4. When communicating to a non-faulty node a, its faulty neighbor b 
assumes the identity of another non-faulty node d. Similarly, a faulty neighbor 
c of d assumes the identity of a. This way, non-faulty nodes a and d are led to 
believe they share an edge. 

Other extensions. Observe that Explorer is designed to disseminate the infor- 
mation about the complete topology to all processes in the system. However, it 
may be desirable to just establish the routes from all processes in the system to 
one or a fixed number of distinguished ones. To accomplish this Explorer needs 
to be modified as follows. No neighborhood information is propagated. Instead 
of the visited set, each message carries the propagation path of the message. 
That is, the order of the relays is significant. 
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forged link 

pretends pretends 
to be d to be a 

Fig. 4. Faulty nodes b and c forge a link between non-faulty nodes a and d. 

Only the distinguished processes initiate the message propagation. The other 
processes only relay the messages. Just as in the original Explorer, a process 
confirms a path to another process only if it receives 2fc + 1 internally process 
disjoint paths from the source or from other confirming processes. Again, like in 
Explorer, such process rebroadcasts the message, but empties the propagation 
path. In the outcome of this program, for every distinguished process, each non- 
faulty process will contain paths to at least 2fc -|- 1 processes that lead to this 
distinguished process. Out of these paths, at least k + 1 ultimately lead to the 
distinguished process. 

In Explorer, for each process the propagation of its neighborhood information 
is independent of the other neighborhoods. Thus, instead of topology. Explorer 
can be used for efficient fault-tolerant propagation of arbitrary information from 
the processes to the rest of the network. 

7 Conclusion 

In conclusion, we would like to outline a couple of interesting research direc- 
tions. The existence of Byzantine-robust topology discovery solutions opens the 
question of theoretical limits of efficiency of such programs. The obvious lower 
bound on message complexity can be derived as follows. Every process must 
transmit its neighborhood to the rest of the nodes in the system. Transmitting 
information to every node requires at least n messages, so the overall message 
complexity is at least Sn'^. If k processes are Byzantine, they may not relay the 
messages of other nodes. Thus, to ensure that other nodes learn about its neigh- 
borhood, each process has to send at least fc -|- 1 messages. Thus, the complexity 
of any Byzantine-robust solution to the topology discovery problem is at least 
in il{Sn'^k). 

Observe that Explorer and Detector may not explicitly identify faulty nodes 
or the inconsistent view of the their immediate neighborhoods. We believe that 
this identification can be accomplished using the technique used by Dolcv [6]. 
In case there are 3fc -t- 1 non-faulty processes, they may exchange the topologies 
they collected to discover the inconsistencies. This approach, may potentially 
expedite termination of Explorer at the expense of greater message complexity: 
if a certain Byzantine node is discovered, the other processes may ignore its 
further messages. 
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